library(tidyverse)
data <- read.csv("data.csv")

Data Description

This dataset comes from the UC Machine Learning Repository Mice Protein Expression. It consists of expression levels of 77 proteins modifications that produced signals in the nuclear fraction of the cortex. The dataset includes measurements from 38 control mice and 34 trisomic mice (Down syndrome), as discussed in the study by @doi:10.1371/journal.pone.0129126.

data

What are the columns represent?

Mouse ID: Unique identifier for each mouse, 15 measurements were taken for each protein per mouse.

Expression Levels: The dataset contains values for the expression levels of 77 proteins.

Genotype: The genotype of each mouse, either control (c) or trisomy (t).

Treatment Type: The treatment administered to the mouse, either memantine (m) or saline (s).

Behavior: Describes the behavioral context of each mouse, either context-shock (CS) or shock-context (SC).

Eight Classes

The mice are classified into eight groups based on features such as genotype, behavior, and treatment.

blue_palette <- colorRampPalette(c("darkblue", "lightblue"))(8)

ggplot(data, aes(class)) +
  geom_bar(aes(fill = class), alpha = 0.8) +
  scale_fill_manual(values = blue_palette) +
  labs(title = "Number of Classes", y = "Count", x="Class")

  • c-CS-s: Control mice, stimulated to learn, injected with saline (9 mice)

  • c-CS-m: Control mice, stimulated to learn, injected with memantine (10 mice)

  • c-SC-s: Control mice, not stimulated to learn, injected with saline (9 mice)

  • c-SC-m: Control mice, not stimulated to learn, injected with memantine (10 mice)

  • t-CS-s: Trisomic mice, stimulated to learn, injected with saline (7 mice)

  • t-CS-m: Trisomic mice, stimulated to learn, injected with memantine (9 mice)

  • t-SC-s: Trisomic mice, not stimulated to learn, injected with saline (9 mice)

  • t-SC-m: Trisomic mice, not stimulated to learn, injected with memantine (9 mice)

Two Genotype

The genotypes can either be control or trisomic.

ggplot(data, aes(x = Genotype, fill = class)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = blue_palette) +
  labs(title = "Classes for feature: Genotype", x = "Class", y = "Frequency") +
  theme_minimal() +
  theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))

Two Behavior

In terms of behavior, some mice have been stimulated to learn (context-shock), while others have not (shock-context).

ggplot(data, aes(x = Behavior, fill = class)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = blue_palette) +
  labs(title = "Classes for feature: Behavior", x = "Class", y = "Frequency") +
  theme_minimal() +
  theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))

Treatment and Control

In order to assess the effect of the drug memantine on the ability to learn in trisomic mice, some mice were injected with the drug, and others were injected with saline as control.

ggplot(data, aes(x = Treatment, fill = class)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = blue_palette) +
  labs(title = "Classes for feature: Treatment", x = "Class", y = "Frequency") +
  theme_minimal() +
  theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))

Data Cleaning and handle missing values

The dataset contains some missing values. In this case, we will handle the missing values by replacing them with the mean of the corresponding protein group.

missing_values <- data %>%
  setNames(gsub("_N", "", names(.))) %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "{.col}")) %>%
  pivot_longer(everything(), names_to = "key", values_to = "missing") %>%
  mutate(
    total = nrow(data),
    pct = (missing / total) * 100,
    isna = missing > 0
  ) %>%
  filter(isna) %>%
  arrange(desc(pct))

percentage.plot <- missing_values %>%
  ggplot(aes(x = reorder(key, pct), y = pct)) +
  geom_bar(stat = 'identity', fill = 'blue', alpha = 1) +
  scale_x_discrete(limits = missing_values$key) +
  coord_flip() +
  labs(title = "Missing Values In the Dataset", x = 'Protein Name', y = "Percentage of Missing Values") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  guides(fill = "none")
percentage.plot

data <- data %>%
  group_by(class) %>%
  mutate(across(everything(), ~ replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
  as.data.frame()

names(data) <- gsub("_N", "", names(data))
proteins <- names(data[2:78])
classes <- as.vector(unique(as.character(data$class)))

Data Visualization

Histogram for Expression Levels

This histogram displays the distribution of protein expression levels across all samples. It shows the frequency of expression values within specified intervals (bins), allowing for an understanding of the overall data distribution. The shape of the histogram reveals patterns such as skewness, central tendency, and the spread of expression levels, providing insights into how proteins are expressed in the dataset. This visualization helps identify any potential trends, outliers, or clusters in the expression levels.

for(i in 1:length(proteins)){
  plot <- ggplot(data, aes(eval(parse(text = proteins[i])))) +
    geom_histogram(aes(fill=..count..), color = "black", alpha = 0.5) +
    scale_fill_gradient("Count", low="blue", high="black") +
    labs(title = proteins[i],
         x = "Protein Expression Level",
         y = "Frequency") +
    theme_light()
  cat("#### ", proteins[i], "\n")
  print(plot)
  cat('\n\n')
}

DYRK1A

ITSN1

BDNF

NR1

NR2A

pAKT

pBRAF

pCAMKII

pCREB

pELK

pERK

pJNK

PKCA

pMEK

pNR1

pNR2A

pNR2B

pPKCAB

pRSK

AKT

BRAF

CAMKII

CREB

ELK

ERK

GSK3B

JNK

MEK

TRKA

RSK

APP

Bcatenin

SOD1

MTOR

P38

pMTOR

DSCR1

AMPKA

NR2B

pNUMB

RAPTOR

TIAM1

pP70S6

NUMB

P70S6

pGSK3B

pPKCG

CDK5

S6

ADARB1

AcetylH3K9

RRP1

BAX

ARC

ERBB4

nNOS

Tau

GFAP

GluR3

GluR4

IL1B

P3525

pCASP9

PSD95

SNCA

Ubiquitin

pGSK3B_Tyr216

SHH

BAD

BCL2

pS6

pCFOS

SYP

H3AcK18

EGR1

H3MeK4

CaNA

Boxplot for Distribution of Protein Expression Level by Class

This boxplot illustrates the distribution of protein expression levels across different classes. Each box represents the range of expression values for a specific protein, showing the median, interquartile range (IQR), and potential outliers. The plot allows for easy comparison of protein expression patterns between classes, highlighting variations and trends in the data. By visualizing the spread and central tendency of expression levels, this boxplot helps identify proteins with significantly different expression across the classes.

for(i in 1:length(proteins)){
  plot <- ggplot(data, aes(x = class, y = eval(parse(text = proteins[i])))) +
    geom_boxplot(aes(fill = class), alpha = 0.8) +
    stat_summary(fun.y=mean, colour="black", geom="point", size=1.5) +
    labs(y = proteins[i]) +
    scale_fill_manual(values = blue_palette)
  cat("#### ", proteins[i], "\n")
  print(plot)
  cat('\n\n')
}

DYRK1A

ITSN1

BDNF

NR1

NR2A

pAKT

pBRAF

pCAMKII

pCREB

pELK

pERK

pJNK

PKCA

pMEK

pNR1

pNR2A

pNR2B

pPKCAB

pRSK

AKT

BRAF

CAMKII

CREB

ELK

ERK

GSK3B

JNK

MEK

TRKA

RSK

APP

Bcatenin

SOD1

MTOR

P38

pMTOR

DSCR1

AMPKA

NR2B

pNUMB

RAPTOR

TIAM1

pP70S6

NUMB

P70S6

pGSK3B

pPKCG

CDK5

S6

ADARB1

AcetylH3K9

RRP1

BAX

ARC

ERBB4

nNOS

Tau

GFAP

GluR3

GluR4

IL1B

P3525

pCASP9

PSD95

SNCA

Ubiquitin

pGSK3B_Tyr216

SHH

BAD

BCL2

pS6

pCFOS

SYP

H3AcK18

EGR1

H3MeK4

CaNA

Principal Component Analysis (PCA) of Sample Data by Class

This PCA plot visualizes the distribution of samples from eight distinct classes based on the first two principal components. Each point represents a sample, and the colors indicate the class to which the sample belongs. The plot provides insights into how the samples are grouped and their variation along the principal components, which can be useful for identifying patterns or clustering trends within the data.

library(ggfortify)
library(plotly)

df <- data[2:78]  
pca_res <- prcomp(df, scale. = TRUE, rank. = 2)

pca_plot <- autoplot(pca_res, data = data, colour = 'class', frame = TRUE, frame.type = 'norm')

interactive_pca <- ggplotly(pca_plot, tooltip = "text")

interactive_pca <- interactive_pca %>%
  layout(
    title = "PCA Analysis of Protein Expression by Class", 
    hovermode = "closest",
    margin = list(
      l = 80,  
      r = 80,  
      b = 80,  
      t = 80
 ))

interactive_pca

ANOVA and Tukey’s HSD Test for Expression Levels

This analysis involves examining differences in expression levels across genotypes using ANOVA, followed by Tukey’s Honest Significant Difference (HSD) test for pairwise comparisons. Due to file size limit, this part is now moved to separate file.

Classification of Mice by Protein Expressions

Option 1: Decision Tree

Based on the result the decision tree has a really bad fit and low accuracy. Due to the file size limit this code and results is now moved to separate python file on git.

library(caret)
library(rpart)
library(randomForest)

data$class <- as.factor(data$class)
X <- data[ , -which(names(data) %in% c("class"))]
y <- data$class

m <- ncol(data) 
dt_model <- train(class ~ ., data = na.omit(data[ , -((m-3):(m-1))]), method = 'rpart', 
                  trControl = trainControl(method = "cv", number = 10))

Option 2: Random Forest

It appears that tuning mtry to 48 provides the most stable and accurate model performance in this grid search. Due to the file size limit this code and results is now moved to separate python file on git.

Result and Discussion

Proteins Discriminating Learning Success in Control Mice

The comparison of control mice that successfully learned (c-CS-s and c-CS-m) to those that did not undergo training (c-SC-s and c-SC-m) identified several proteins significantly associated with learning. A total of 31 proteins showed significant differences between the groups of mice that underwent successful learning versus those that did not. Proteins such as BRAF, ERK, and pERK, components of the MAPK signaling pathway, were identified. This pathway is well-established in its role in synaptic plasticity and learning. Immediate early gene proteins like EGR1 and BDNF were found to be elevated in the successful learning groups.

Proteins Associated with Normal vs. Failed Learning

When comparing successful learning with saline treatment (c-CS-s) to the non-stimulated control mice (c-SC-s), there were 24 proteins that were significantly different, including DYRK1A, SOD1, and ITSN1, all of which were previously implicated in learning and memory functions. But in Ts65Dn mice, learning failure (c-CS-s) was associated with a distinct protein signature, with proteins like APP and DYRK1A showing upregulated expression, which are known to be linked to cognitive deficits in Down syndrome.

Proteins Associated with Learning in Trisomic Mice

When comparing t-CS-s (failed learning) vs. t-SC-s (non-learning), there were 10 proteins with significant expression differences, including DYRK1A, ITSN1, pERK, BRAF, and SOD1. Some of these proteins were also altered in normal learning of control mice, indicating that even in failed learning, trisomic mice exhibited some normal protein responses.

When comparing t-CS-m (rescued learning) vs. t-SC-m (non-learning with memantine), 36 proteins were significantly different, with 17 showing similar changes to those in normal control mice, suggesting that memantine promotes more normal responses but not entirely. A similar comparison of t-CS-m vs. t-SC-s (non-learning with saline)identified 20 proteins, further showing how memantine rescued learning by altering protein profiles.

Memantine alone (t-SC-m vs. t-SC-s) influenced 12 proteins, such as pNR2A, pPKCAB, and ARC, some of which were also present in control mice. The comparison of rescued (t-CS-m) vs. failed learning (t-CS-s) showed only 9 significant proteins, indicating that while memantine could rescue some aspects of learning, it didn’t fully restore normal protein expression levels.